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Abstract 

ANOVA decompositions are a standard method for describing and estimating heterogeneity 
among the means of a response variable across levels of multiple categorical factors. In such 
a decomposition, the complete set of main effects and interaction terms can be viewed as a 
collection of vectors, matrices and arrays that share various index sets defined by the factor levels. 
For many types of categorical factors, it is plausible that an ANOVA decomposition exhibits 
some consistency across orders of effects, in that the levels of a factor that have similar main- 
effect coefficients may also have similar coefficients in higher-order interaction terms. In such a 
case, estimation of the higher-order interactions should be improved by borrowing information 
from the main effects and lower-order interactions. To take advantage of such patterns, this 
article introduces a class of hierarchical prior distributions for collections of interaction arrays 
that can adapt to the presence of such interactions. These prior distributions are based on 
a type of array-variatc normal distribution, for which a covariance matrix for each factor is 
estimated. This prior is able to adapt to potential similarities among the levels of a factor, and 
incorporate any such information into the estimation of the effects in which the factor appears. 
In the presence of such similarities, this prior is able to borrow information from well-estimated 
main effects and lower-order interactions to assist in the estimation of higher-order terms for 
which data information is limited. 

Key Words: array- valued data; Bayesian estimation; cross-classified data; factorial design; MAN OVA 
penalized regression; tensor; Tucker product; sparse data. 
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1 Introduction 



Cross-classified data are prevalent in many disciplines, including the social and health sciences. For 
example, a survey or observational study may record health behaviors of its participants, along 
with a variety of demographic variables such as age, ethnicity and education level, by which the 
participants can be classified. A common data analysis goal in such settings is the estimation of the 
health behavior means for each combination of levels of the demographic factors. In a three-way 
layout, for example, the goal is to estimate the three-way table of population cell means, where 
each cell corresponds to a particular combination of factor levels. A standard estimator of the table 
is provided by the table of sample means, which can alternatively be represented by its ANOVA 
decomposition into additive effects and two- and three-way interaction terms. 

The cell sample means provide an unbiased estimator of the population means, as long as there 
are observations available for each cell. However, if the cell-specific sample sizes are small then 
it may be desirable to share information across the cells to reduce the variance of the estimator. 
Perhaps the simplest and most common method of information sharing is to assume that certain 
mean contrasts among levels of one set of factors are equivalent across levels of another set of 
factors, or equivalently, that certain interaction terms in the ANOVA decomposition of population 
cell means are exactly zero. This is a fairly large modeling assumption, and can often be rejected 
via plots or standard F-tests. If such assumptions are rejected, it still may be desirable to share 
information across cell means, although perhaps in a way that does not posit exact relationships 
among them. 

As a concrete example, consider estimating mean macronutrient intake across levels of age, 
ethnicity and education from the National Health and Nutrition Examination Survey (NHANES). 
Table [T] summarizes the cell-specific sample sizes for intake of overall carbohydrates as well as two 
subcategories (sugar and fiber) by age, ethnicity, and education levels for male respondents (more 
details on these data are provided in Section 5). Studies of carbohydrate intake have been motivated 



by a frequently cited relationship between carbohydrate intake and health outcomes (Chandalia 



et al. 2000 Moerman et al. 1993). However, these studies generally report on marginal means 



of carbohydrate intake across demographic variables, and do not take into account potential non- 



additivity, or interaction terms, between them (Park et al. 2011 Montonen et al. 2003). A more 
detailed understanding of the relationship between mean carbohydrate intake and the demographic 
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Table 1: Cross-tabulation of the demographic variables. 

variables can be obtained from a MANOVA decomposition of the means array into main-effects, 
two- and three-way interactions. Evidence for interactions can be assessed with approximate i^-tests 
based on the Pillai trace statistics (Olson, 1976), presented in Table [2j 
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Table 2: Testing MANOVA variance components via Pillai's trace statistic. 

The .F-tests indicate strong evidence that the two- and three-way interactions are not zero. 
Based on these results, standard practice would be to retain the full model and describe the in- 
teraction patterns via various contrasts of sample cell means. However, OLS estimates of these 
interactions typically have undesirably high mean squared error (MSE) , due to the small number of 
observations for each two- or three-way combination of factors. A variety of penalized least squares 
procedures have been proposed in order to reduce MSE, such as ridge regression and the lasso. 
Recent variants of these approaches allow for different penalties on ANOVA terms of different or- 



ders, including the ASP method of Beran (2005), and grouped versions of the lasso (Yuan and Lin 
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Figure 1: Plots of main effects and interaction correlations for the three outcome variables (car- 
bohydrates, sugar and fiber). The first row of plots gives OLS estimates of the main effects for 
each factor. The second row of plots gives correlations of effects between levels of each factor, with 
white representing 1 and black representing -1. 



2007 Friedman et al. 2010). Corresponding Bayesian approaches include Bayesian lasso proce- 



dures (Yuan and Lin 2005; Genkin et al. 2007 Park and Casella 2008) and multilevel hierarchical 



priors (Pittau et al., 2010 Park et al., 2006). 



While these procedures attain a reduced MSE by shrinking linear model coefficient estimates 
towards zero, they do not generally take full advantage of the structure that is often present in 
cross-classified datasets. In the data analysis example above, two of the three factors (age and 
education) are ordinal, with age being a binned version of a continuous predictor. Considering 
factors such as these more generally, suppose a categorical factor x is a binned version of some 
underlying continuous or ordinal explanatory variable x (such as income, age, number of children 
or education level). If the mean of the response variable y is smoothly varying in the underlying 
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variable x, we would expect that adjacent levels of the factor x would have similar main effects 
and interaction terms. Similarly, for non-ordinal factors (such as ethnic group or religion) it is 
possible that two levels represent similar populations, and thus may have similar main-effects and 
interaction terms as well. We refer to such similarities across the orders of the effects as order 
consistent interactions, 

Returning to the NHANES data, Figure [I] summarizes the OLS estimates of the main effects 
and two-way interactions for the three outcome variables (carbohydrates, sugar and fiber). Not 
surprisingly, the main effects for the ordinal factors (age and education) are "smooth," in that 
the estimated main effect for a given level is generally similar to the effect for an adjacent level. 
Additionally, some similarities among the ethnic groups appear consistent across the three outcome 
variables. To assess consistency of such similarities between main effects and two-way interactions, 
we computed correlations of parameter estimates for these effects between levels of each factor. 
For example, there are 3 x 10 = 30 main-effect and two-way interaction estimates involving each 
level of age: For each of the three outcome variables, there is 1 main-effect estimate for each age 
level, 4 estimates from the age by ethnicity interaction and 5 estimates from the age by education 
interaction. We compute a correlation matrix for the five levels of age based on the resulting 30 x 5 
matrix of parameter estimates, and similarly compute correlations among levels of ethnicity and 
among levels of education. The second row of Figure [l] gives grayscale plots of these correlation 
matrices. The results suggest some degree of order consistent interactions: For the ordinal factors, 
the highest correlations are among adjacent pairs. For the ethnicity factor, the results suggest that 
on average, the effects for the Mexican category are more similar to the Hispanic category than to 
the other ethnic categories, as we might expect. 

The OLS estimates of the main effects and two-way interactions presented above, along with the 
fact that two of the three factors are ordinal, suggest the possibility of order consistent interactions 
among the array of population cell means. More generally, order consistent interactions may be 
present in a variety of datasets encountered in the social and health sciences, especially those 
that include ordinal factors, or factors for which some of the levels may represent very similar 
populations. In this paper, we propose a novel class of hierarchical prior distributions over main 
effects and interaction arrays that can adapt to the presence of order consistent interactions. The 
hierarchical prior distribution provides joint estimates of a covariance matrix for each factor, along 
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with the factor main effects and interactions. Roughly speaking, the covariance matrix for a given 
factor is estimated from the main effects and interactions in which the factor appears. Conversely, 
an estimate of a factor's covariance matrix can assist in the estimation of higher-order interactions, 
for which data information is limited. We make this idea more formal in the next section, where 
we construct our prior distribution from a set of related array normal distributions with separable 



covariance structures (Hoff, 2011), and provide a Markov chain Monte Carlo algorithm for inference 
under this prior. In Section 3 we provide a simulation study comparing estimation under our 
proposed prior to some standard estimators. As expected, our approach outperforms others when 
the data exhibit order consistent interactions. Additionally, for data lacking any interactions, our 
approach performs comparably to the OLS estimates obtained from the additive model (i.e. the 
oracle estimator). In Section 4 we extend this methodology to MANOVA models in order to 
analyze the multivariate NHANES data presented above. In addition to estimates of main effects 
and interactions, our analysis provides measures of similarity between levels of each of the factors. 
We conclude in Section 5 with a summary of our approach and a discussion of possible extensions. 



2 A hierarchical prior for interaction arrays 

In this section we introduce the hierarchical array (HA) prior, and present a Markov chain Monte 
Carlo (MCMC) algorithm for posterior approximation and parameter estimation. The HA prior 
is constructed from several semi-conjugate priors, and so the MCMC algorithm can be based on a 
straightforward Gibbs sampling scheme. 

2.1 The hierarchical array prior 

For notational convenience we consider the case of three categorical factors, and note that the 
HA prior generalizes trivially to accommodate a greater number of factors. Suppose the three 
categorical factors have levels {1, . . . , mi}, {1, . . . , 1112} and {1, . . . , 771,3} respectively. The standard 
AN OVA model for a three-way factorial dataset is 

Vijki = n + ai + bj+ c k + (ab)^ + (ac) lk + (bc) jk + (abc) ijk + e ijk i (1) 
{eijkl} ~ i.i.d. normal(0,CT 2 ). 
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Let a denote the mi x 1 vector of main effects for the first factor, (ab) denote the mi x 771,2 matrix 
describing the two-way interaction between the first two factors, (abc) denote the mi x 7772 x 777,3 
three-way interaction array, and let b, c, (ac), and (be) be defined similarly. Bayesian inference 
for this model proceeds by specifying a prior distribution for the ANOVA decomposition 9 = 
{/j, a, b, c, (ab), (ac), (be), (abc)} and the error variance a 2 . 

As described in the Introduction, if two levels of a factor represent similar populations, we 
would expect that coefficients of the decomposition involving these two levels would have similar 
values. For example, suppose levels i\ and %2 of the first factor correspond to similar populations. 
We might then expect to be close to aj 2 , the vector {(ab)i lt j, j = 1, . . . , 7712} to be close to the 
vector {(ab)i 2 j,j = I, ... ,1712}, and so on. We represent this potential similarity between levels 
of the first factor with a covariance matrix £ a , and consider a prior distribution on the ANOVA 
decomposition such that 

Cov[a] = E[aa T ] = S a , 
E[(ab)(abf] = k ab ^ a , 
E[(ac)(ac) T ] = k ac E a , 



E[(a6c)(i)(a6c)(" 1) ] = k ab j: a , 



where k ab , k ac and k abc are scalars. Here, (abc)^ is the matricization of the array (abc), which 
converts the mi x 7772 x 7773 array into an mi x (7772777,3) matrix ( Kolda and Baderj 2009). To 



accommodate similar structure for the second and third factors, we propose the following prior 
covariance model for the main effects and interaction terms: 

Cov[a] = S a Cov[6] = S fe Cov[c] = S c 

Cov[vec(a6)] = S fe (g> S a /7ab Cov[vec(6c)] = S c (g> T, h /^ bc Cov[vec(ac)] = S c ® S a /7ac 

Cov[vec(a6c)] = S c <g> S 6 <g> S a /7abc, 

where "0" is the Kronecker product. The covariance matrices T, a , £& and S c represent the simi- 
larities between the levels of each of the three factors, while the scalars j a b, lac-, 76c ) labc represent 
the relative (inverse) magnitudes of the interaction terms as compared to the main effects. Further 
specifying the priors on the ANOVA decomposition parameters as being mean-zero and Gaussian, 
the prior on a is then the multivariate normal distribution iV mi (0, E a ), and the prior on vec(ab) is 



iV mim2 (0, E& (g) ^a/lab)- This latter distribution is sometimes referred to as a matrix normal dis- 
tribution ( |Dawid 1981). Similarly, the prior on vec(a6c) is N mini2m3 (0, E c ® E& <8) E a /7 a f, c ), which 



has been referred to as an array normal distribution (Hoff, 2011). 

In most data analysis situations the similarities between the levels of a given factor and mag- 
nitudes of the interactions relative to the main effects will not be known in advance. We there- 
fore consider a hierarchical prior so that E , E&, E c and the 7-parameters are estimated from 
the data. Specifically, we use independent inverse- Wishart prior distributions for each covariance 
matrix, e.g. E a ~ inverse- Wishart (r/ a o, S~ ) and gamma priors for the 7-parameters, e.g. ^ab ~ 
gamma(f a feo/2, T^ fe0 /2), where rj a , S a , v a bo and T 2 b0 are hyperparameters to be specified (some de- 
fault choices for these parameters are discussed at the end of this section). This hierarchical prior 
distribution can be viewed as an adaptive penalty, which allows for sharing of information across 
main effects and interaction terms: For example, any similarities between levels of the first factor 
that are consistent across the main effects and two-way interactions will be reflected in the estimate 
of E a , which in turn assists in the estimation of the three-way interaction for which there is limited 
data information. 



2.2 Posterior approximation 

Due to the semi-conjugacy of the HA prior, posterior approximation can be obtained from a straight- 
forward Gibbs sampling scheme. Under this scheme, iterative simulation of parameter values from 
the corresponding full conditional distributions generates a Markov chain having a stationary dis- 
tribution equal to the target posterior distribution. For computational simplicity, we consider the 
case of a balanced dataset in which the sample size in each cell is equal to some common value n, 
in which case the data can be expressed as an m\ x m<i x 7713 x n four-way array Y . A modification 
of the algorithm to accommodate unbalanced data is discussed in the next subsection. 

Derivation of the full conditional distributions of the grand mean fi and the error variance a 2 are 
completely standard: Under a N(^iq,Tq) prior for /u, the corresponding full conditional distribution 
is iV(/ii, T-j 2 ), where r 2 = (1/tq +nmim2m3/o 2 )~ 1 and fii = t 2 ([1o/tq -\-nm1m2m3f /a 2 ), where f = 
T,ijki(yijki-[ a i+bj+Ck+(ab)ij+(ac)i k +(bc) jk +(abc) ikj ])/n Under an inverse-gamma(i/ /2, ^o<?"o/2) 
prior distribution, the full conditional distribution of a 2 is an inverse-gamma (z^i /2, v\ a\ /2j distri- 
bution, where v\ = u + nmim 2 m, 3 , v\a\ = v^o 2 + Yjijki iVijkl ~ Vijk) 2 and ^ ijk = H + Oj + bj + 
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Cfc + (ab)ij + (ac)jfe + (bc)jk + (abc)^. Derivation of the full conditional distributions of param- 
eters other than \i and a 2 is straightforward, but slightly non-standard due to the use of matrix 
and array normal prior distributions for the interaction terms. In what follows, we compute the 
full conditional distributions for a few of these parameters. Full conditional distributions for the 
remaining parameters can be derived in an analogous fashion. 

Full conditionals of a and (abc): To identify the full conditional distribution of the vector a of 
main effects for the first factor, let 

r ijk i = Vijki ~ (m + bj +c k + (ab)^ + (ac) ik + (bc) jk + (abc) ijk J 

i.e., r^ki is the "residual" obtained by subtracting all effects other than a from from the data. Since 
{^ijki} ~ i-i-d- normal(0, a 2 ), we have 

o f 1712171311 rp T-\l 

p(Y\6,a ) oc a exp| (a a - 2a r)| , 

where f = (n, . . . , f mi ) with fj = Yjjki r ijki/ (m 2 m 3 n), 6 = {a, b, c, (ab), (ac), (6c), (abc)} and "oc a " 
means "proportional to as a function of a." Combining this with the iV mi (0, S a ) prior distribution 
for a, we have 

p{a\Y, 6- a , a ) oc a exp I [a a - 2a r\ - -a S a a I 

and so the full conditional distribution of a is multivariate normal with 

Var[a|y,fl_ a ,a 2 ] = (S" 1 + Im 2 m z n/ a 2 )' 1 

E[a\Y, 8- a ,<J 2 ] = (S^ 1 + Im 2 m 3 n/a 2 ) 1 f x (m 2 m 3 n/a 2 ), 

where / is the mi x mi identity matrix. 

Derivation of the full conditional distributions for the interaction terms is similar. For example, 
to obtain the full conditional distribution of (abc) let r^ki be the residual obtained after subtracting 
all other components of 6 from the data, so that r^ki = (abc)- k + e^. Let f be the three-way 
array of cell means of {r^ki}, so that = ^[fijki/n. Combining the likelihood in terms of f 
with the N mim2Tn3 (0, S c (g) £& (g) X a /7 abc ) prior density for vec(aftc) gives 

p{(abc) |Y,<7 2 ,£ a ,£ 6 ,£ c ,7 abc ,0_( abc) ) oc (a6c) exp (-^ vec (afrc) T vec (abc) - 2vec (a6c) T vec(f) ) 

exp ^ _ ^ vec (abc) T (S c S b ® Sc/Tabc)" 1 vec (abcj^j 



and so vec(abc) has a multivariate normal distribution with variance and mean given by 
Y&r[vec(abc)\Y,9_ {abc) ,a 2 ] = ^ c (g>J: b (g)T la /-f abc y 1 +In/a 2 ^ 
E[vec{abc)\Y, 9_ {abc) ,a 2 } = ((S c <g> S fe ® S a / 7(l6c )- 1 + /n/a 2 ) 1 vec(f) x n/a 2 . 

Full conditional distributions for the remaining effects can be derived analogously. 

Full conditional of S a : The parameters in the ANOVA decomposition whose priors depend on 
S a are a, (ab), (ac) and (abc). For example, the conditional density of (ab) given T, a , T, b and 7a & 
can be written as 

p((ab)\E a , E b , j ab ) = |27rS 6 ® £ a / 7 abr 1/2 exp (-vec(a6) T [S b ® £ a /7 a6 ] _1 vec(a&)/2) 
a Sa |S a r m2 / 2 etr (-S~ 1 7a5 (a6) T S 6 - 1 (a5)/2) 
= iS.I-^^etrC-S- 1 ^^), 

where S ab = 7 a fe(a6) T E^" 1 (a6) and etr(^4) = exp{trace(^4)} for a square matrix A. Similarly, the 
priors for a, (ac) and (abc) involve the following terms: 

S a = aa T 

Sac = 7ac(ac)S~ 1 (ac) T 
Sabc = labc(ahc) ( i ) (T, c ®T lb y 1 (abcjf 1) . 

Multiplying together the prior densities for a, (ab), (ac) and (abc) and the inverse-Wishart(?7 a o, S~q) 
prior density for S a gives 

p(£ a |0,£ b ,£ c , 7 ) oc | Sa |-(i+-i+'fco+i+m 2 +m3+ m2 m3)/2 etr (- S - 1 (5 a0 + 5 a + 5^ + 5 ac + 5^/2) . 

It follows that the full conditional distribution of S a is inverse- Wishart (?7ai , S~^) where rj a i = 
%o + (l + m 2 + + vn,2"mj) and S^i = S a o + S a + S ab + S ac + S abc . The full conditional expectation 
of E a is therefore 5 a i/ (rf a \ — mi — 1), which combines several estimates of the similarities among 
the levels of the first factor, based the main effects and the interactions. 

Full conditional of j abc : The full conditional distribution of j abc depends only on the (abc) 
interaction term. The normal prior for (abc) can be written as 

p((abc)\E a , S 6 , S c , 7 abc ) oc 7ai)c 12™ 2 ™' 3 ' 1 ' 1 exp{-7 afec vec(a6c) T [S c S a ]~ 1 vec(a6c) T /2} 
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Combining this density with a gamma(V a & c o/2, T 2 bc0 /2) prior density yields a full conditional for 
7 a6c that is gamma(i/ 6 c i/2,T^ 6cl /2), where 

Vabd = v abc0 + mim 2 m 3 

T lbc\ = T IbcO + vec(a6c) T [S c ® S fc <g> E a ] -1 vec(a&c). 
2.3 Balancing unbalanced designs 

For most survey data we expect the sample sizes {ra^} to vary across combinations of factors. As a 
result, the full conditional distributions of the ANOVA decomposition parameters are more difficult 
to compute. For example, the conditional variance of the three-way interaction vec(a6c) changes 
from (^~f a bc (S c (g) <S> S a ) _1 + In/a 2 ^ in the balanced case to ^7 a fe c (S c ® ® S a ) _1 + D/a 2 ^j 
in the general case, where -D is a diagonal matrix with diagonal elements vec({ny&}). Even for 
moderate numbers of levels of the factors, the matrix inversions required to calculate the full 
conditional distributions in the unbalanced case can slow down the Markov chain considerably. 
As an alternative, we propose the following data augmentation procedure to "balance" an un- 
balanced design. Let Y° be the three-way array of cell means based on the observed data, i.e. 
Vijk = ~Ylyijki/ n ijk- Letting n = max ({n^}), for each cell ijk with sample size < n and at 
each step of the Gibbs sampler, we impute a cell mean based on the "missing" n — observations 
as Vijk ~ normal a 2 /[n 

max — n ijk])i where is the population mean for cell ijk based on the 
current values of the ANOVA decomposition parameters. We then combine y°- k and y^ k to form 
the "full sample" cell mean = {nijkV°j k + (n — nijk)y^- k )/n. This array of cell means provides 
the sufficient statistics for a balanced dataset, for which the full conditional distributions derived 
above can be used. 



2.4 Setting hyperparameters 

In the absence of detailed prior information about the parameters, we suggest using a modified 
empirical Bayes approach to hyperparameter selection based on the maximum likelihood estimates 
(MLEs) of the error variance and mean parameters. Priors for jj, and a 2 can be set as unit in- 



formation priors (Kass and Wasserman, 1995), whereby hyperparameters are chosen so that the 
prior means are near the MLEs but the prior variances are set to correspond roughly to only one 
observation's worth of information. For the covariance matrices S a , and S c , recall that the prior 
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for the main effect a of the first factor is N mi (0, E a ). Based on this, we choose the prior for T, a to be 
inverse-Wishart(f a o, -S^ 1 ) with u a o = m\ + 2 and 5 a o = \a\ 2 I mi /mi, where a is the MLE of a. Under 
this prior, E[tr(S a )] = |d| 2 , and so the scale of the prior matches the empirical estimates. Finally, 
the 7-parameters can be set analogously, using diffuse gamma priors but centered around values to 
match the magnitude of the OLS estimates of the interaction terms they correspond to, relative to 
the magnitude of the main effects. For example, in the next section we use a gamma(f a fco/2, 
prior for j a f, in which v q = 1 and r 2 b(j = |d| 2 |6| 2 /| (ab)\ 2 , where d, b and (ab) are the OLS estimates. 



3 Simulation study 

This section presents the results of three simulation studies comparing the HA prior to several 
competing approaches. The first simulation study uses data generated from a means array that 
exhibits order consistent interactions. Estimates based on the HA prior outperform standard OLS 
estimates as well as estimates from a standard Bayesian approach that is similar to the one in 



Gelman (2005), and is also related to a grouped version of the lasso procedure (Yuan and Lin 
2006 ) . The second simulation study uses data from a means array that exhibits "order inconsistent" 
interactions, i.e. interactions without consistent similarities in parameter values between levels of 
a factor. In this case the HA prior still outperforms the OLS and standard Bayes approaches, 
although not by as much as in the presence of order consistent interactions. The third simulation 
study uses data from a means array that has an exact additive decomposition, i.e. there are no 
interactions. The HA prior procedure again outperforms the standard Bayes and OLS approaches, 
although it does not do as well as OLS and Bayes oracle estimators that assume the correct additive 
model. 



3.1 Data with order consistent interactions 

The data in this simulation study is generated from a model where the means array exhibits order 
consistent interactions. The dimensions of the means array M were chosen to be mi x m<i x 771,3 = 
15 x 7 x 3, which could represent, for example, the number of categories we might have for age, 
education level and political affiliation in a cross-classified survey dataset. The means array was 
generated from a cubic function of three variables that was then binned. Figure [2] plots the mean 
array across the third factor, demonstrating the nonadditivity present in M. By decomposing M 
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factor 1 factor 1 factor 1 

Figure 2: The means array M across levels of the third factor. 

into the main, two-way and three-way effects in the same manner as described in Section [2j we 
can summarize the nonadditivity of M through the magnitudes of the different sums of squares. 
The magnitudes of the main effects ||a|| 2 /mi, \\b\\ 2 /m,2, and ||c|| 2 /m% are 5.267, 0.012, 0.004 respec- 
tively. Those of the two-way interactions ||a6|| 2 /(rn\m,2), ||ac|| 2 /{rn\mz), and ||6c|| 2 /(m<2 m 3) are 
1.365, 1.312, and 0.384, and the magnitude of the three-way interaction ||a6c|| 2 / (mim^m^) is 0.474. 
For each sample size n G {400, 1000, 5000, 10000}, we simulated 50 datasets using the mean array M 
and independent standard normal errors. In order to make a comparison to OLS possible, we first 
allocated one observation to each cell of the means array and then distributed the remaining obser- 
vations uniformly, leading to a complete but potentially unbalanced design. The average number 
of observations per cell under the sample sizes {400, 1000, 5000, 10000} was {1.3, 3.2, 15.9, 31.7}. 

For each simulated dataset we obtained estimates under the HA prior (using the hyperparam- 
eter specifications described in Section 2.4), as well as ordinary least squares estimates (OLS) and 
posterior estimates under a standard Bayesian prior (SB). The SB approach is essentially a sim- 
plified version of the HA prior in which the parameter values are conditionally independent given 
the hyperparameters: {di} ~ i.i.d. A^(0,tr 2 ), {{ab)^} ~ i.i.d. A r (0,a 2 fc ) and {(afrc) ijfc } ~ i.i.d. 
(0, c^ bc ) and similarly for all other main effects and interactions. To facilitate comparison to the 
HA prior, the hyperpriors for these <7 2 -parameters are the same as the hyperpriors for the inverses 
of the 7-parameters in the HA approach. As a result, this standard Bayes prior can be seen as the 
limit of a sequence of HA priors where the inverse- Wishart prior distributions for the S-matrices 
converge to point-masses on the identity matrices of the appropriate dimension. 
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For each simulated dataset, the Gibbs sampler described in Section [2] was run for 11,000 itera- 
tions, the first 1,000 of which were dropped to allow for convergence to the stationary distribution. 
Parameter values were saved every 10th scan, resulting in 1,000 Monte Carlo samples per simulation. 
Starting values for all the mean effects were set to zero and all variances set to identity matrices of 
the proper dimensions. We examined the convergence and autocorrelation of the marginal samples 
of the error variance a 2 using Geweke's z-test and the effective sample size. The minimum effective 
sample size across all simulations was 233 out of the 1000 recorded scans, and the average effective 
sample size was 895. Geweke's ^-statistic was less than 2 in absolute value in 93, 93, 97, and 95 
percent of the Markov chains for the four sample sizes (with the percentages being identical for 
both Bayesian methods). While the cases in which \z\ > 2 were not extensively examined, it is 
assumed that running the chain longer would have yielded improved estimation. 

For each simulated data set, the posterior mean estimates Mha and Mge were obtained by 
averaging their values across the 1,000 saved iterations of the Gibbs sampler. The average squared 
error (ASE) of estimation was calculated as ASE(M) = \\M - M\\ 2 /(mi 17121113) where M is the 
means array that generated the data. These values were then compared across the three approaches. 
The left panel of Figure [3] demonstrates that the SB estimator provided a reduction in ASE when 
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Figure 3: Comparison of ASE for different estimation methods. 



compared to the OLS estimator for all data sets with sample sizes 400 and 1000, 96% of the data sets 
with sample size 5000 and 90% of data sets with sample size 10000. The second panel demonstrates 
that the HA estimator provides a substantial further reduction in ASE for all data sets. As we 
would expect, the reduction in ASE is dependent on the sample size and decreases as the sample 
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size increases. 

These results are not surprising: By estimating the variances a 2 a , o\ h , etc. from the data, the SB 
approach provides adaptive shrinkage and so we expect these SB estimates to outperform the OLS 
estimates in terms of ASE. However, the SB approach does not use information on the similarity 
among the levels of an effect, and so its estimation of higher order interactions relies on the limited 
information available directly in the corresponding sufficient statistics. As such, we expect the SB 
estimates to perform less well than the HA estimates, which are able to borrow information from 
well-estimated main effects and low-order interactions to assist in the estimation of higher-order 
terms for which data information is limited. Additionally, recall that the parameters in the mean 
array M were generated by binning a third-degree polynomial, and were not generated from array 
normal distributions, i.e. the HA prior is "incorrect" as a model for M. Even so, the HA prior 
is able to capture the similarities between adjacent factor levels, resulting in improved estimation. 
However, we note that not all of the improvement in ASE achieved by the HA prior should be 
attributed to the identification of order-consistent interactions. The simulation study that follows 
suggests some of the performance of the HA prior is due to additional parameter shrinkage provided 
by the inverse- Wishart distributions on the S-matrices 

3.2 Data with order inconsistent interactions 



level 1 of factor 3 level 2 of factor 3 level 3 of factor 3 




1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 1 3 5 7 9 11 13 15 

factor 1 factor 1 factor 1 



Figure 4: The means array M for the second simulation study, across levels of the third factor. 

In this subsection we evaluate the HA approach for populations which exhibit interactions that 
are order inconsistent. The means array M is constructed by taking the means array from Section 
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3.1 decomposing it into main effects, two- and three-way interactions, permuting the levels of each 
factor within each effect, and reconstructing a means array. That is, if {aj : i = 1,. . . , mi} is the 
collection of parameters for the first main effect and |(a6)^ : i = 1 . . . ,mx,j = 1, . . . ,m2| is the 
collection of parameters for the two way interaction between the first and second factors in Section 
|3.l| then {« 7ri (i)} an d |( a ^)7r 2 (i)7r 3 (j)} are ^ ne ma i n effect and two-way interaction parameters for 
the means array in this section, where tti,tt2 and its are independent permutations. The remaining 
effects were permuted analogously. Due to this construction, the magnitudes of the main effects, 
two- and three-way interactions remain the same, but the process becomes less "smooth," as can 
be seen in Figure |4| 

Again, 50 data sets were generated for each sample size, and estimates MhAj -^sb arid Mols 
were obtained for each data set, where the Bayesian estimates were obtained using the same MCMC 
approximation procedure as in the previous subsection. Figure [5] compares ASE across the different 
approaches. The left panel of Figure [5j as with the left panel of Figure [3j demonstrates that the SB 
estimator provides a reduction in ASE when compared to the OLS estimator. As expected, since 
neither of these approaches take advantage of the structure of the order consistent interactions, 
this plot is nearly identical to the corresponding plot in Figure [3} 

The second panel demonstrates that the HA estimator provides a further reduction in ASE for 
all data sets, although this reduction is less substantial than in the presence of order consistent 
interactions. The lower ASE of the HA estimates may be initially surprising, as there are no order 
consistent interactions for the HA prior take advantage of. We conjecture that the lower ASE is 
due to the additional shrinkage on the parameter estimates that the inverse- Wishart priors on the 
S-parameters provide. For example, under both the SB and HA priors we have Cov[vec(o6)] = 
Sfe (g) I^a/jab, but under the former the covariance matrices are set to the identity whereas under 
the latter they have inverse- Wishart distributions. 

3.3 Data without interactions 

In this subsection we evaluate the HA approach for populations in which interactions are not 
present. In addition to the SB and OLS estimators, we compare the HA approach to two "oracle" 
estimators: the additive model least squares estimator (AOLS) and Bayes estimator under the 
additive model (ASB). The prior used by the ASB approach is the same as the SB prior, but does 
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Figure 5: Comparison of ASE for different estimation methods. 



not include terms other than main effects in the model. 

As before, 50 data sets were generated for each sample size, and estimates Mha, Msbi -^ols> 
Masb and Maols were obtained for each data set, where the Bayesian estimates were obtained 
using the same MCMC approximation procedure as in the previous two subsection. Some results 
are shown in Figure |6j which compares ASE across the different approaches. In the top row of 
Figure [6] we see that the performance of the HA estimates is comparable to but not as good as the 
the oracle least squares and Bayesian estimates in terms of ASE. Specifically, the ASE for the HA 
estimates is 24.2, 18.6, 20.1 and 17.4 percent higher than for the AOLS estimates for data sets with 
sample sizes 400, 1000, 5000 and 10000 respectively. Similarly, the ASE for the HA estimates is 25, 
19.7, 20.3 and 17.8 percent higher than for the ASB estimates for data sets with sample sizes 400, 
1000, 5000 and 10000 respectively. However, the bottom row of Figure [6] shows that the HA prior is 
superior to the other non-oracle OLS and SB approaches that attempt to estimate the interaction 
terms. 

These results, together with those of the last two subsections, suggest that the HA approach 
provides a competitive method for fitting means arrays in the presence or absence of interactions. 
When order consistent interactions are present, the HA approach is able to make use of the sim- 
ilarities across levels of the factors, thereby outperforming approaches that cannot adapt to such 
patterns. Additionally, the HA approach does not appear to suffer when interactions are not or- 
der consistent. In the absence of interactions altogether, the HA approach adapts well, providing 
estimates similar to those that assume the correct additive model. 
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Figure 6: Comparison of ASE for different estimation methods. 

4 Analysis of carbohydrate intake 

In this section we estimate average carbohydrate, sugar and fiber intake by education, ethnicity 
and age using the HA procedure described in Section [2j Our estimates are based on data from 
2134 males from the US population, obtained from the 2007-2008 NHANES survey. Nutrient intake 
is self reported on two non-consecutive days. Each day's data include food and beverage intake 
information from the preceding 24 hour period only, and is calculated using the USDA's Food and 



Nutrient Database for Dietary Studies 4.1 (USDA, 2010). All intake was measured in grams, and 



we average the intake over the two days to yield a single measurement per individual. We are 
interested in relating the intake data to the following demographic variables: 

• Age: (31 - 40) , (41 - 50) , (51 - 60) , (61 - 70) , (71 - 80). 

• Education: Primary (P), Secondary (S), High School diploma (HD), Associates degree (AD), 
Bachelors degree (BD). 
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Figure 7: Two-way plots of the transformed data. 

• Ethnicity: Mexican (not Hispanic), Hispanic, white (not Hispanic) and black (not Hispanic). 

Sample sizes for age-education-ethnicity combinations were presented in Table [T] in Section 1 . Of 
the 2234 male respondents within the above demographic groups, 100 were missing their nutrient 
intake information for both days, with similar rates of missingness across the demographic variables. 
For the purpose of this example analysis, we treat this information as missing at random. 

The data on the original scale are somewhat skewed and show heteroscedasticity across the 
demographic variables. Since different variances across groups can lead to bias in the sums of squares 



estimates, making F-tests for interactions anti-conservative (Miller and Brown 1997), stabilizing 



the variance is desirable. Figure [7] provides two-way scatterplots of the response variables after 
applying a quarter power transformation to each variable, which we found stabilized the variances 
across the groups better than either a log or square-root transformation. Additionally, we centered 
and scaled each response variable to have mean zero and variance one. 



4.1 MANOVA model and parameter estimation 

As presented in Table [T] of Section 1, F-tests indicate evidence for the presence of interactions in the 
array of population cell means. However, 12% of all age-education-ethnicity categories have sample 
sizes less than 5, and so we are concerned with overfitting of the OLS estimates. As an alternative, 
we extend the HA methodology described in Section [2] to accommodate a MANOVA model. Our 
MANOVA model has the same form as the ANOVA model given by Equation Q, except that each 
effect listed there is a three-dimensional vector corresponding to the separate effects for each of the 
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three response variables. Additionally, the error terms now have a multivariate normal distribution 
with zero- mean and unknown covariance matrix E^. 

We extend the hierarchical array prior discussed above to accommodate the p-variate MANOVA 
model as follows: Our prior for the mi x p matrix a of main effects for the first factor is vec(a) ~ 
N mip (0,I <S> S a ), where I is the p x p identity matrix. Our prior for the mi x m-2 x p array (ab) 
of two-way interaction terms is given by vec(aft) ~ A mim2P (0, r~ 6 (g> ® S a ). Here, is a p x p 
diagonal matrix whose terms determine the scale of the two-way interactions for each of the p 
response variables. Similarly, our prior for the four-way array (abc) of three-way interaction terms 
is vec(a6c) ~ N mim2m3P (0, 

^ abc ® ^ c ® ® ^a)- Priors for other main effects and interaction 
terms are defined similarly. The hyperpriors for each diagonal entry of T are independent gamma 
distributions, chosen as in Section 2.4 so that the prior magnitude of the effects for each response 
is centered around the sum of squares of the effect from the OLS decomposition. 

A Gibbs sampling scheme similar to the one outlined in Section [2] was iterated 200,000 times 
with parameter values saved every 10 scans, resulting in 20,000 simulated values of the means array 
M and the covariance matrices {S e th, S age , S e d u } for posterior analysis. Mixing of the Markov chain 
for M was good: Figure [8] shows MCMC samples of 4 out of 320 entries of M (chosen so that their 
trace plots were visually distinct). The autocorrelation across the saved scans was low, with the 
lag-10 autocorrelation for the thinned chain being less than 0.14 in absolute value for each element 
of M (97.3% of entries have lag-10 autocorrelation less than 0.07 in absolute value) and effective 
sample sizes between 1929 and 13520. The mixing for the elements of the covariance matrices £ e th> 
E age , S e du was not as good as that of the means array M: The maximum absolute value of lag-10 
autocorrelation of the saved scans for the three rescaled covariance matrices was 0.18, 0.12, and 
0.19 respectively. The effective sample sizes for the elements of the covariance matrices were at 
least 1684. 

4.2 Posterior inference for M and the Ss 

We obtain a Monte Carlo approximation to M = E [M\Y] by averaging over the saved scans of the 
Gibbs sampler. Figure [9] provides information on the shrinkage and regularization of the estimates 
due to the HA procedure, as compared to OLS. The first panel plots the difference between the 
OLS and Bayes estimates of the cell means versus cell-specific sample sizes. For small sample sizes, 
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Figure 8: MCMC samples of 4 out of 320 entries of the means array M. 

the Bayes estimate for a given cell is affected by the data from related cells, and can generally be 
quite different from the OLS estimate (the cell sample mean). For cells with large sample sizes the 
difference between the two estimates is generally small. The second panel of the figure plots the 
OLS estimates of the cell means for carbohydrate intake of black survey participants across age and 
education levels. Note that there appears to be a general trend of decreasing intake with increasing 
age and education level, although the OLS estimates themselves are not consistently ordered in this 
way. In contrast, these trends are much more apparent in the Bayes estimates plotted in the third 
panel. The HA prior allows the parameter estimates to be close to additive, while not enforcing 
strict additivity in this situation where we have evidence of non-additivity via the F-tests. 



The first row of Figure 10 provides the estimates of the main effects from the HA procedure. 
The second row of Figure 10 summarizes covariance matrices {S et h, S age , £ e du} via the posterior 
mean estimates of the correlation matrices {C^r/} — {^d,ij/ \f^d,ii^- i d.jj 

} for d G {eth, age, edu}. In 

this figure, the diagonal elements are all 1, and darker colors represent a greater departure from one. 
The range of the estimated correlations was -0.34 to 0.42 for age categories, -0.30 to 0.35 for ethnic 
groups, and -0.37 to 0.38 for educational categories. For the two ordered categorical variables, 
age and education, we see that closer categories are generally more positively correlated than ones 
that are further apart. While the ethnicity variable is not ordered, its correlation matrix informs 
us of which categories are more similar in terms of these response variables. The middle panel of 
the second row of Figure 10 confirms the order-consistent interactions we observed in Figure [Tj 



Mexican survey participants are more similar to Hispanic participants in terms of carbohydrate 
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Figure 9: Shrinkage and regularization plots. The first panel plots the difference between the OLS 
and HA estimates of a cell- mean against the cell-specific sample sizes. The second a third panels 
plot estimated cell-means for black survey participants across age and education levels, where lighter 
shades represent higher levels of education. 

intake patterns than to white or black participants. 

5 Discussion 

This article has presented a novel hierarchical Bayes method for parameter estimation of cross- 
classified data under ANOVA and MANOVA models. Unlike least-squares estimation, a Bayesian 
approach provides for regularized estimates of the potentially large number of parameters in a 
MANOVA model. Unlike the non-hierarchical Bayesian approach, the hierarchical approach pro- 
vides a data-driven method of regularization, and unlike the standard hierarchical Bayes, the hi- 
erarchical array prior can identify similarities among categories and share this information across 
interaction effects to assist in the estimation of higher-order terms for which data information is 
limited. In a simulation study the HA approach was able to detect interactions when they were 
present, and to estimate the means array better than a full least squares or standard Bayesian 
approaches (in terms of mean squared error). When the true means array was completely additive, 
the HA prior was able to adapt to this smaller model better than the other full model estimation 
approaches under consideration. 

Generalizations of the HA prior are applicable to any model whose parameters consist of vectors, 
matrices and arrays for which some of the index sets are shared. This includes generalized linear 
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Figure 10: Plots of main effects and interaction correlations for the three outcome variables (car- 
bohydrates, sugar and fiber). The first row of plots gives HA estimates of the main effects for each 
factor. The second row of plots gives correlations of effects between levels of each factor, with white 
representing 1 and black representing -1. 

models with categorical factors, as well as ANCOVA models that involve interactions between 
continuous and categorical explanatory variables. As an example of the latter case, suppose we 
are interested in estimating the linear relationship between an outcome and a set of explanatory 
variables for every combination of three categorical factors. The regression parameters then consist 
of an mi x mi x 771,3 x V array, where 771-1,7772,7773 are the numbers of factor levels and p is the 
number of continuous regressors. The usual ANCOVA decomposition can be used to parametrize 
this array in terms of main effects and interactions arrays, for which a hierarchical array prior may 
be used. 

Computer code and data for the results in Sections [3] and [4] are available at the authors' websites: 



www . stat . Washington . edu/~volf , www . stat . Washington . edu/~hof f 
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